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Abstract 


A  face  identification  algorithm  is  presented  that  automatically  processes 
an  unknown  image  by  locating  and  identifying  the  face.  The  heart  of 
the  algorithm  is  the  use  of  pursuit  filters.  A  matching  pursuit  filter  is  an 
adapted  wavelet  expansion,  where  the  expansion  is  adapted  to  both  the 
data  and  the  pattern  recognition  problem  being  addressed.  For  identifica¬ 
tion,  the  filters  find  the  features  that  differentiate  among  faces,  whereas  for 
detection,  the  filters  encode  the  similarities  among  faces.  The  filters  are  de¬ 
signed  through  a  simultaneous  decomposition  of  a  training  set  into  a  two- 
dimensional  wavelet  expansion.  This  yields  a  representation  that  is  explic¬ 
itly  two-dimensional  and  encodes  information  locally. 

The  algorithm  uses  coarse  to  fine  processing  to  locate  a  small  set  of  key  fa¬ 
cial  features,  which  are  restricted  to  the  nose  and  eye  regions  of  the  face. 
The  result  is  an  algorithm  that  is  robust  to  variations  in  facial  expression, 
hair  style,  and  the  surrounding  environment.  Based  on  the  locations  of 
the  facial  features,  the  identification  module  searches  the  database  for  the 
identity  of  the  unknown  face  using  matching  pursuit  filters  to  make  the 
identification. 

The  algorithm  was  demonstrated  on  three  sets  of  images.  The  first  set  was 
images  from  the  FERET  database.  The  second  set  was  infrared  and  visible 
images  of  the  same  people.  (These  two  sets  allowed  the  examination  of 
algorithm  performance  on  infrared  and  visible  images  individually,  and  on 
fused  data  from  both  modalities.)  The  third  set  of  images  was  mugshot  data 
from  a  law  enforcement  application. 
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1.  Introduction 


There  are  many  applications  in  modern  society  for  a  successful  face  identi¬ 
fication  system:  nonintrusive  identification  and  verification  for  credit  cards 
and  ATM  machines,  nonintrusive  access  control  to  buildings  and  restricted 
areas,  and  monitoring  of  ports  of  entry  for  terrorists  and  smugglers.  For 
the  designer  of  pattern  recognition  algorithms,  face  recognition  is  a  very 
challenging  problem.  The  goal  is  to  develop  an  algorithm  that  can  differ¬ 
entiate  among  a  population  of  three-dimensional  curved  objects  that  all 
have  the  same  basic  shape,  from  databases  whose  sizes  vary  from  a  cou¬ 
ple  of  hundred  individuals  to  over  one  million.  Further,  the  face  itself  is 
a  dynamically  varying  object:  facial  expressions,  makeup,  facial  hair,  and 
hairstyle  all  change  over  time.  The  conditions  under  which  facial  imagery 
is  collected  also  contribute  to  the  difficulty  of  developing  face  recognition 
algorithms.  The  lighting,  background,  pose  of  the  face,  scale,  and  param¬ 
eters  of  the  acquisition  are  all  variables  in  facial  imagery  collected  under 
real-world  scenarios.  A  key  to  successfully  developing  a  general  face  iden¬ 
tification  system  is  to  systematically  solve  a  sequence  of  subproblems  of 
increasing  complexity.  One  critical  subproblem  is  the  development  of  an 
algorithm  that  can  identify  faces  from  a  database  of  frontal  facial  imagery. 

In  this  report,  I  describe  an  algorithm  to  perform  the  above  task  that  is 
based  on  matching  pursuit  filters,  a  small  set  of  facial  features,  and  a  sim¬ 
ple  geometric  model  of  the  face.  The  set  of  features  consists  of  the  nose 
and  eye  regions  of  the  face  and  the  interior  of  the  face  at  a  reduced  scale. 
Since  the  nose  and  eye  regions  are  the  most  stable  and  least  varying  parts 
of  the  face,  restricting  attention  to  these  features  increases  the  robustness 
of  the  algorithm  with  respect  to  variations  in  facial  expressions,  hair  style, 
and  background.  The  interior  of  the  face  is  included  so  that  information  is 
encoded  on  the  overall  shape  of  the  face.  The  geometric  model  describes 
the  spatial  relationship  between  the  facial  features:  e.g.,  the  eyes  are  above 
the  nose,  and  the  left  eye  is  to  the  left  of  the  right  eye.  The  knowledge  of 
the  spatial  relationship  guides  the  search  for  facial  features  and  ensures  a 
realistic  arrangement  of  features  during  identification. 

Face  recognition  is  substantially  different  from  classical  pattern  recogni¬ 
tion  problems,  such  as  character  recognition.  In  character  recognition  there 
are  a  limited  number  of  classes  (usually  less  than  50),  with  a  large  num¬ 
ber  of  training  examples  in  each  class,  whereas  in  face  recognition,  there 
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are  a  large  number  of  faces,  or  classes,  and  only  a  few  training  examples 
per  face.  In  our  case,  there  is  only  one  training  example  per  person,  and 
the  size  of  the  gallery  exceeds  300  individuals.  (A  gallery  is  a  collection 
of  images  of  known  individuals;  an  image  of  an  unknown  face  presented 
to  the  algorithm  is  called  a  probe.)  Because  of  the  size  of  the  gallery,  it  is 
neither  practical  nor  desirable  to  handcraft  a  representation  that  charac¬ 
terizes  faces.  Therefore,  to  be  able  to  identify  faces  from  large  galleries,  we 
need  a  method  to  automatically  find  features  that  distinguish  one  face  from 
another. 

The  neural  network  community  is  pursuing  techniques  that  automatically 
select  features  that  distinguish  among  classes  of  objects.  A  few  relevant 
techniques  are  feed-forward  networks  [1],  principal  component  analysis 
[2],  projection  pursuit  [3-5],  factorial  analysis  [6,7],  dynamic  link  architec¬ 
tures  [8],  and  entropy-based  wavelet  encoding  of  images  [9].  These  tech¬ 
niques  either  originated  in  the  neural  network  community  (feed-forward 
networks,  factorial  analysis,  and  dynamic  link  architecture),  or  have  found 
applications  within  neural  networks  (principal  component  analysis  and 
projection  pursuit).  The  work  of  a  number  of  these  authors  [5-9]  is  moti¬ 
vated  by  theories  of  the  human  visual  system.  The  goal  of  these  approaches 
is  to  find  representations  that  are  data  driven. 

The  heart  of  our  face  recognition  algorithm  is  a  new  tool  for  creating  ef¬ 
ficient  and  compact  models,  called  the  matching  pursuit  filter  technique, 
which  is  an  adaptive  wavelet  expansion.  A  wavelet  expansion  of  an  im¬ 
age  is  adaptive  if  the  choice  of  the  wavelet  basis  depends  on  the  image(s). 
The  main  innovation  of  a  matching  pursuit  filter  is  that  it  is  a  wavelet 
expansion  that  is  both  data-  and  problem-adaptive;  i.e.,  the  expansion  is 
adapted  to  both  the  data  and  the  pattern  recognition  problem  being  ad¬ 
dressed.  This  contrasts  with  most  adaptive  schemes,  where  the  representa¬ 
tion  is  a  function  of  the  data,  but  not  a  function  of  the  problem  to  be  solved. 
In  a  problem-driven  expansion  such  as  a  matching  pursuit  filter,  a  filter  that 
detects  faces  is  designed  to  encode  the  similarities  of  all  faces  in  the  training 
set,  whereas  the  filter  that  identifies  faces  automatically  encodes  the  differ¬ 
ences  among  the  faces  in  the  training  set. 

In  image  compression  and  signal  analysis,  adaptive  wavelets  have  been 
used  to  decompose  an  individual  image  or  signal  [9-12].  In  these  works 
an  algorithm  selects  the  wavelets  by  minimizing  a  cost  function:  the  error 
between  the  reconstructed  image  and  the  original  image.  Because  a  single 
image  or  signal  is  being  decomposed,  it  is  possible  to  find  the  optimal  basis. 
The  matching  pursuit  technique  of  Mallat  and  Zhang  [11]  uses  a  greedy  al¬ 
gorithm  to  decompose  an  individual  one-dimensional  signal.  In  this  work 
I  generalize  Mallat  and  Zhang's  matching  pursuit  algorithm  to  simultane¬ 
ously  decompose  multiple  images  for  application  to  pattern  recognition. 
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In  the  applications  presented  here,  the  matching  pursuit  filter  design  algo¬ 
rithm  simultaneously  decomposes  all  the  images  in  the  training  set;  thus 
it  is  not  computationally  feasible  to  optimize  a  global  cost  function.  In¬ 
stead,  the  algorithm  selects  the  basis  elements  by  using  a  greedy  algorithm. 
In  each  iteration  of  the  greedy  algorithm,  the  algorithm  chooses  the  next 
wavelet  in  the  expansion  by  minimizing  a  cost  function.  For  matching  pur¬ 
suit  filters,  the  cost  function  incorporates  the  decision-making  criterion  of 
the  associated  pattern  recognition  problem. 

The  use  of  matching  pursuit  filters  is  a  nonparametric  technique  for  find¬ 
ing  the  differences  among  faces.  Two  related  parametric  techniques,  which 
are  cases  of  projection  pursuit,  are  principal  component  analysis  and  Fisher 
discriminant  analysis  [13].  (Principal  component  analysis  has  been  applied 
to  face  recognition  [14-16]  and  to  object  recognition  and  detection  [17,18]; 
discriminant  analysis  has  been  applied  to  face  recognition  [19-21].)  Fisher 
discriminant  analysis  gives  the  optimal  linear  discriminant  among  classes 
(faces)  when  the  distribution  of  each  class  is  Gaussian.  When  there  is  only 
one  example  per  class,  discriminant  analysis  reduces  to  a  variant  of  prin¬ 
cipal  component  analysis,  which  produces  the  optimal  linear  compression 
for  least  square  error. 

With  one  example  per  face,  the  assumption  behind  principal  component 
analysis  is  that  compression  correlates  with  the  differences  among  faces.  In 
contrast,  matching  pursuit  filters  explicitly  find  these  differences.  (In  this 
report,  I  concentrate  on  the  case  where  there  is  one  example  per  face.)  For 
multiple  examples  per  face,  the  matching  pursuit  expansion  cost  function 
could  be  modified  to  incorporate  such  information.  Theoretically,  differ¬ 
ences  in  performance  are  determined  by  how  well  the  distribution  of  the 
images  of  the  face  is  modeled  by  a  Gaussian  distribution.  The  closer  to  a 
Gaussian  distribution,  the  better  the  performance  of  Fisher  discriminant 
analysis.  For  one  example  per  face,  I  compare  the  performance  of  matching 
pursuit  filters  and  principal  component  analysis  in  two  experiments  (sect. 
4.3  and  4.4). 

To  exploit  the  spatial  structure  in  images,  matching  pursuit  filters  explicitly 
model  the  two-dimensional  structure  of  objects:  the  images  are  treated  as 
functions  in  two  variables,  and  the  wavelet  basis  is  constructed  from  two- 
dimensional  directional  filters.  Because  wavelets  are  local  filters,  one  can 
directly  see  how  the  model  relates  to  the  training  examples.  As  a  result,  the 
spatial  arrangement  of  the  wavelet  basis  has  a  tangible  meaning  in  rela¬ 
tion  to  objects  in  the  training  set.  Figure  1  shows  the  reconstructions  of  four 
faces  from  an  identification  filter.  By  examining  the  reconstructions  of  the 
faces,  one  sees  how  the  filter  encodes  facial  differences.  The  reconstructions 
are  not  faithful  to  the  original  images,  because  the  filter  design  algorithm 
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Figure  1.  Reconstructions 
of  faces  from  an 
identification  filter.  Images 
are  reconstructed  with 
equation  (5),  sect.  2.1,  with 
30  coefficients. 


selects  a  basis  that  differentiates  among  faces,  not  a  basis  that  minimizes  the 
reconstruction  error.  In  matching  pursuit  filters,  the  explicit  representation 
of  the  image  by  two-dimensional  wavelets  captures  both  local  and  global 
features.  This  contrasts  with  discriminant  analysis  and  principal  compo¬ 
nent  analysis,  where  images  are  treated  as  vectors,  and  the  representation 
is  global. 

In  computer  vision,  artificial  intelligence,  and  neuroscience,  approaches  to 
finding  representations  of  objects  from  data  are  active  areas  of  research.  The 
dynamic  link  architecture  of  Lades  et  al  [8]  is  a  general  object  recognition 
technique  that  represents  objects  by  projecting  an  image  onto  a  rectangu¬ 
lar  array  of  Gabor  jets.  Wiskott  et  ah  [22]  have  specialized  the  architecture 
for  face  recognition.  The  work  of  Rao  and  Ballard  on  iconic  representation 
[23,24]  takes  a  similar  approach,  except  that  images  are  projected  onto  a  jet 
of  directional  derivatives  of  Gaussian  densities.  In  dynamic  link  architec¬ 
ture  and  iconic  representation,  the  basis  (filters)  that  represents  an  object 
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is  chosen  by  the  algorithm  designer.  In  the  work  of  Viola  [25],  a  basis  is 
iteratively  constructed  from  subwindows  in  a  set  of  training  images.  The 
selection  criterion  for  basis  vectors  is  their  orthogonality  to  basis  vectors 
already  chosen,  not  their  power  to  separate  classes. 

Two  measures  of  success  for  a  face  recognition  algorithm  are  the  ability  to 
recognize  faces  from  a  large  gallery  and  the  ability  to  automatically  process 
probes.  In  the  algorithm  represented  here,  images  are  automatically  proc¬ 
essed  with  a  two-stage  system.  The  first  stage  locates  the  face  and  a  small 
set  of  facial  features.  The  second  stage  identifies  the  face  given  the  location 
from  the  first  stage.  In  this  report,  I  demonstrate  algorithm  performance  by 
identifying  faces  from  two  large  galleries  (sect.  4.2  and'4.4).  The  first  is  a 
gallery  of  311  individuals,  with  one  image  per  person;  the  images  are  from 
the  FERET  database  [26].  The  second  is  a  gallery  of  2175  individuals,  with 
images  from  mugshot  data. 

This  approach  contrasts  with  the  majority  of  algorithms  described  in  the 
literature,  for  which  results  are  given  on  small  galleries  (<50  individuals) 
with  many  images  per  person  in  the  gallery  (>5  images).  Since  these  algo¬ 
rithms  also  require  that  the  face  be  in  a  predetermined  position,  the  first 
stage  is  not  needed. 

Only  a  handful  of  algorithms  have  been  tested  on  galleries  of  more  than  150 
individuals  and  have  the  ability  to  process  images  automatically  [22,27-30]. 
Swets  and  Weng  [31]  tested  their  algorithm  on  FERET  and  other  images, 
but  their  algorithm  is  not  fully  automatic.  Cox  et  al.  [32]  identified  faces 
from  a  database  of  685  images,  but  required  that  an  operator  manually  lo¬ 
cate  35  points  on  each  face. 

One  area  of  recent  interest  is  recognizing  faces  in  infrared  (IR)  imagery. 
This  interest  is  driven  by  the  ability  of  IR  cameras  to  acquire  images  of 
faces  in  the  dark.  This  report  also  presents  results  from  a  study  comparing 
algorithm  performance  on  visible  versus  IR  images  for  face  recognition. 


5 


2.  Matching  Pursuit  Filters 


The  original  matching  pursuit  idea  of  Mallat  and  Zhang  [11]  uses  a  greedy 
heuristic  to  iteratively  construct  a  best-adapted  decomposition  of  a  func¬ 
tion  /  on  5ft.  The  algorithm  works  by  choosing  at  each  iteration  i  the  wave¬ 
let  g  in  the  dictionary  V  that  has  maximal  projections  onto  the  residue  of  /. 
The  best-adapted  decomposition  is  selected  by  the  following  greedy  strat¬ 
egy.  Let  R°f  =  /;  then  gt  is  chosen  such  that 

\(Rtf,gi)\  =  max\(Rif,gi)\,  (1) 

geT> 

where 

Ri+1f  =  Rif-(Rif,gi)gi 

for  i  >  1. 

The  algorithm  selects  each  wavelet  in  the  expansion  by  maximizing  the 
right-hand  term  in  equation  (1).  This  equation  allows  for  an  expansion 
based  on  a  single  function  and  minimizes  the  reconstruction  error.  To  ex¬ 
tend  the  technique  to  pattern  recognition,  we  replace  the  right-hand  side 
with  a  function  Cg,  which  (1)  allows  for  the  simultaneous  expansion  of 
multiple  templates  (functions),  and  (2)  incorporates  knowledge  of  the  pat¬ 
tern  recognition  problem  being  addressed.  The  extension  from  functions 
/  on  5ft  to  functions  (templates)  t  on  5ft2  is  straightforward:  a  dictionary  of 
two-dimensional  wavelets  is  used. 

2.1  Matching  Pursuit  Filters  for  Detection 

Matching  pursuit  filters  have  two  components.  The  first  component  is  how 
the  face  or  a  facial  feature  is  represented — for  example,  all  noses.  For  match¬ 
ing  pursuit  filters,  all  noses  are  represented  by  a  given  basis.  The  second 
component  is  the  representation  of  a  particular  nose.  I  first  discuss  the  en¬ 
coding  of  an  instance  of  a  nose  for  a  given  basis,  and  then  show  how  the 
basis  is  selected. 

A  particular  nose  (or  in  the  more  general  case,  an  instance  of  an  object)  is 
represented  as  an  n-dimensional  vector  (a0, . . .  ,a„_i),  called  a  coefficient 
vector.  One  computes  the  coefficient  values  a*  by  projecting  the  image  of  a 
nose  onto  a  basis  {<?o>  •  •  •  >  gn-i}>  which  need  not  be  orthogonal.  Because  the 
basis  is  not  necessarily  orthogonal,  an  iterative  projection  algorithm  calcu¬ 
lates  the  coefficients.  If  the  basis  is  orthogonal,  then  the  algorithm  reduces 
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to  the  standard  projection  method.  The  projection  algorithm  adjusts  for  the 
nonorthogonality  by  using  residual  images.  If  t  is  an  image  or  template, 
then  RH  is  the  residual  image  during  iteration  i,  where  R°t  =  t.  The  coeffi¬ 
cient  fM  is  the  projection  of  the  residual  image  Rl  onto  the  basis  element  gl, 
or  mathematically 

a-i  =  (Rlt,gi),  (2) 

where  (-,  *}  is  the  inner  product  between  two  functions.  The  residual  image 
is  updated  after  each  iteration  by 

-  Oi-M-i,  (3) 


for  i  >  1. 

After  the  nth  iteration,  an  image  t  is  decomposed  into  a  sum  of  residual 
images: 

n—  1 

t,  =  i&t  -  Ri+lt)  +  Rnt-  (4) 

i=0 

Rearranging  equation  (3)  and  substituting  into  equation  (4)  yields 

n—  1 

t  =  ^2  aiSi  +  Rnti 

i= 0 

and  the  approximation  of  the  original  image  after  n  iterations  is  given  by 

71  —  1 

t=J2Qi9i-  (5) 

i= 0 

The  approximation  need  not  be  very  accurate,  because  only  enough  infor¬ 
mation  is  encoded  to  allow  detection.  By  examining  the  reconstruction  t, 
one  sees  which  features  or  details  distinguish  noses  from  non-noses.  In  con¬ 
trast,  reconstructions  from  adapted  wavelet  expansions  based  on  an  image 
compression  admissibility  criterion  will  have  a  greater  fidelity  to  the  origi¬ 
nal  image(s). 

The  goal  of  the  detection  algorithm  is  to  determine  whether  an  observed 
pattern  belongs  to  a  particular  class:  i.e.,  is  this  a  nose?  For  this  determina¬ 
tion  to  be  made,  there  must  be  a  way  of  measuring  the  similarity  between 
two  objects  or  patterns.  With  matching  pursuit  filters,  one  compares  the 
coefficient  vectors  from  two  objects,  where  the  coefficient  vectors  are  gen¬ 
erated  by  the  same  basis.  The  similarity  measure  between  two  objects  is 
the  angle  between  their  coefficient  vectors.  This  measure  is  invariant  under 
linear  changes  in  the  contrast  of  the  image.  Furthermore,  if  the  basis  is  com¬ 
posed  of  wavelets,  then  the  similarity  measure  is  also  invariant  to  the  illu¬ 
mination  level  in  the  image.  An  L2  function  t  is  a  wavelet  if  /  t(x)  dx  =  0; 
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this  requirement  is  referred  to  as  the  wavelet  condition  [33].  More  precisely, 
the  similarity  measure  is  invariant  to  linear  changes  in  illumination. 

The  second  part  of  representing  noses  is  choosing  an  appropriate  basis. 
In  the  ideal  basis,  all  noses  would  have  the  same  coefficient  vector,  and 
all  occurrences  of  this  coefficient  vector  would  be  a  nose.  Unfortunately, 
this  does  not  occur.  An  alternative  criterion  is  to  select  a  basis  where  the 
coefficient  vectors  that  represent  noses  cluster.  The  vector  that  is  the  cluster 
center  is  referred  to  as  a  proto-nose  (or,  in  general,  a  proto-object).  This  vector, 
or  coefficient  vector,  represents  an  average  nose.  The  matching  pursuit  filter 
design  algorithm  searches  for  such  a  representation. 

The  matching  pursuit  filter  is  trained  on  m  different  examples  of  noses.  Let 
{fi, . . . ,  tm}  be  m  examples  of  noses,  where  t,/  contains  one  example  of  a 
nose.  The  noses  are  aligned  in  the  templates  so  that  the  center  of  the  nose 
is  the  origin.  For  objects  other  than  noses,  the  examples  are  aligned  about 
a  common  point.  Using  these  examples,  the  algorithm  selects  the  basis  ele¬ 
ments  from  a  dictionary  V. 

In  the  work  described  here,  a  dictionary  is  composed  of  two-dimensional 
directional  wavelets.  These  wavelets  were  chosen  because  they  encode  in¬ 
formation  locally  at  different  scales  and  orientations.  The  basis  elements 
in  the  dictionary  do  not  span  the  space  of  possible  images.  The  dictionary 
excludes  high-frequency  wavelets  to  reduce  the  effect  of  high-frequency 
noise.  It  also  excludes  low-frequency  wavelets,  for  computational  consider¬ 
ations  and  to  avoid  encoding  information  in  the  background.  The  wavelets 
in  these  dictionaries  can  be  centered  at  any  place  in  the  region  containing  a 
nose  in  the  training  set  (fig.  2). 

For  face  recognition,  the  algorithm  uses  a  dictionary  derived  from  the  sec¬ 
ond  partial  derivatives  of  Gaussian  densities  and  their  Flilbert  transforms, 
which  were  selected  because  they  are  directional  edge  detectors.  The  wave¬ 
lets  do  not  need  to  be  self-inverting  because  I  am  not  interested  in  recon- 


Figure  2.  A  matching 
pursuit  filter  scanning  an 
image.  Center  of  filter  is  O, 
which  moves  as  image  is 
scanned.  This  filter  has  five 
basis  elements,  go,  g\,  #2, 
gz,  and  g 4.  Centers  of 
wavelets  gi  relative  to  O 
are  marked  by  "+"  signs. 
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Figure  3.  Pseudo-code  for 
basis  selection  algorithm. 


structing  the  images.  (I  do  not  address  the  issue  of  what  an  "optimal"  dic¬ 
tionary  is  for  a  particular  problem.  This  difficult  problem  is  beyond  the 
scope  of  this  report.) 

A  greedy  algorithm  selects  the  basis  elements.  In  iteration  i,  the  basis  func¬ 
tion  gi  is  selected.  The  choice  of  g%  is  a  function  of  the  residual  images  RHi 
and  coefficients  al3  from  previous  iterations,  i.e.,  j  <  i.  Let  the  coefficient 
alj  =  (. RHi ,  gj),  that  is,  the  jth  coefficient  for  template  l.  The  set  of  coeffi¬ 
cients  generated  through  the  ith  iteration  is  denoted  by  A*  =  U; (of, — ,  a[), 
i  >  0,  and  A_i  =  0. 

Each  iteration  of  the  basis  selection  algorithm  consists  of  three  steps.  (Pseudo¬ 
code  for  the  basis  selection  algorithm  is  given  in  fig.  3.)  In  the  first  step,  the 
basis  function  gi  is  selected.  In  the  second  step,  the  coefficient  vectors  for 
each  template  t./  are  updated.  In  the  third  step,  the  residual  images  are  up¬ 
dated  by  Rl+1t)  =  RHi  —  a[gi.  The  ith  basis  function  is  selected  by  the 
following  optimization  procedure: 

gi  =  argmin  Cg{Rlt\, . . . ,  Rltm,  A*_i), 

geV 

where  Cg  measures  how  well  the  coefficient  vectors  cluster  when  the  Ah 
basis  function  is  g.  The  function  Cg  is  evaluated  for  each  g  e  V,  and  the 
g  that  minimizes  Cg  is  selected  as  the  basis  element  g2.  (Pseudo-code  for 
evaluating  Cg  is  presented  in  fig.  4.)  In  the  current  implementation  of  Cg 
for  a  given  g,  the  cluster  is  the  mean  of  ( al0 , . . . ,  a|_l5  ( RH /.  g)),  1  <  l  <  m. 
Once  the  cluster  vector  is  determined,  Cg  computes  the  average  distance 
from  the  coefficient  vectors  to  the  cluster  vector.  This  distance  is  a  measure 
of  scatter  (variance)  of  the  coefficient  vectors  about  the  cluster  vector.  If  the 
dispersion  is  small,  then  g  is  a  good  candidate  for  g on  the  other  hand,  if 
the  dispersion  is  large,  then  g  is  a  poor  choice.  The  technique  extends  to 
k  proto-noses  or  cluster  vectors.  In  this  case,  Cg  generates  k  clusters  and 

1:  R°tt  =  ti ; 

2:  do  i  —  0  to  i  <  number  of  iterations,  n; 

\*number  of  iterations  =  number  of  desired  basis  elements.*\ 

3:  Compute  Cg  for  each  wavelet  in  the  dictionary; 

4:  Select  the  ith  basis  element, 

gi  is  the  wavelet  that  minimizes  Cg; 

5:  Update  coefficients  for  each  template, 
a'i  =  ( R'U.gi ); 

6:  Update  the  residue  images  for  each  template  , 

R'+ 1  ti  =  R’ti  —  ali9i; 

7:  Increment  the  iteration  counter, 
i  =  ?  -I- 1; 

8:  end  do 
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Figure  4.  Pseudo-code  for 
evaluating  Cg  during 
design  of  detection  filters. 


g  is  the  wavelet  for  which  Cg  is  to  be  evaluated. 

Iteration  i. 

(ao, . . . ,  a  ■_!)  are  the  coefficients  computed  through  iteration  i  -  1 
to  represent  template  U. 

1:  Compute  the  centroid  of  (al0, . . .  {Wti^g))', 

Let  g,  be  the  centroid. 

2:  Compute  the  mean  distance  U  between  fi  and  (o^, . . . ,  a-_l5  {RHi,g)), 
3:  Return  U  as  the  value  of  Cg; 


measures  the  spread  of  the  coefficient  vectors  from  the  cluster  vectors.  The 
clusters  are  found  with  a  fc-means  algorithm.  Figure  5  illustrates  the  design 
of  a  detection  filter.  Figures  5(a)  to  (d)  are  the  training  set,  and  5(e)  is  the 
nose  in  5(b),  reconstructed  by  equation  (5)  with  30  coefficients. 

The  algorithm  is  iterated  until  n  basis  elements  are  selected.  The  choice  of 
the  number  of  basis  elements  depends  on  the  performance  level  desired 
and  is  usually  determined  experimentally.  If  n  is  too  small,  then  the  false- 
alarm  rate  is  too  high;  if  n  is  too  large,  the  filter  will  not  generalize  to  noses 
outside  the  training  set. 

The  output  from  the  matching  pursuit  filter  design  algorithm  is  an  ordered 
list  of  n  basis  elements  and  a  list  of  n  coefficients.  The  combination  of  both 
lists  is  a  matching  pursuit  filter.  If  the  filter  design  algorithm  generates  k 
proto-noses,  the  matching  pursuit  filter  consists  of  the  basis  elements  and 
the  k  coefficient  lists  (coefficient  vectors).  The  location  of  the  basis  elements 
encodes  the  geometric  structure  of  an  object  (fig.  2).  The  centers  of  the  basis 
elements  gi  are  usually  not  aligned.  This  is  illustrated  in  figure  1,  where  a 
filter  represents  an  object  (nose)  that  is  larger  than  the  support  of  an  indi¬ 
vidual  basis  element. 

A  matching  pursuit  filter  detects  a  nose  by  scanning  a  nose  detection  fil¬ 
ter  across  an  image  (fig.  2),  which  results  in  a  response  image  T.  The  re¬ 
sponse  at  pixel  (ui ,  112)  measures  the  similarity  between  the  region  centered 
at  («i,  112)  and  the  proto-nose.  One  criterion  detects  the  center  of  the  nose 
at  the  maximum  response  in  T .  An  alternative  method  reports  all  points 
above  a  threshold  as  the  center  of  the  nose.  The  algorithm  computes  the 
response  at  a  pixel  (u\,  U2)  by  comparing  the  proto-nose  coefficient  vector 
with  an  image  coefficient  vector  a(zti,  «2).  There  is  an  image  coefficient  vec¬ 
tor  a(iti,  U2)  for  each  pixel.  (Fig.  6  is  a  pseudo-code  description  of  the  filter 
detection  algorithm.) 
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Figure  5.  Design  of 
detection  filters:  (a)  to  (d)  1 

Training  set.  (e)  Nose  in  (b)  & 

reconstructed  by  1 

equation  (5).  1 


ESfc . 1 
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Figure  6.  Pseudo-code  for 
scanning  an  image  with  a 
detection  filter. 


2(ui ,  U2)  :  input  image. 

A  =  (ao,  ...,an-i)  :  coefficient  vector  that  represents  proto-nose. 
T(ui,U2)  :  response  image. 

1:  do  for  all  pixels  (uj ,  u2)  in  image  J; 

2:  Compute  image  coefficient  vector  a(ui ,  u2)  using  equation  (6), 
a(ui ,  u2)  =  (ao(wi.u2)....,an-i(tii,tt2)); 

3:  Compute  filter  response  at  pixel  ( u\ ,  U2), 

F(ui,U2)  —  d0(A.a(wi,u2)); 

4:  end  do; 

5:  Search  T  for  the  location  the  nose(s); 


The  algorithm  computes  the  image  coefficient  vector  a(ui,  U2)  by  expand- 
ing  the  image  about  the  pixel  (u \ .  i/.-j).  This  expansion  is  accomplished  by 


translation  of  the  basis  elements  gt  by  (ui,  112),  and  projection  of  the  image 
on  the  translated  basis  elements.  Let  ai(ui,u2)  he  the  ith  coefficient  of 
a(ui,  U2)',  then 

ai(ui,u2)  =  (Rll,gi{-  +  ui,-  +  u2)).  (6) 

After  the  image  coefficient  vectors  have  been  determined,  the  next  step 
computes  the  response  image.  Let  A  =  (no, ... ,  an-i)  he  the  cluster  vec¬ 
tor  that  represents  the  proto-nose;  then  R(ui,U2)  =  d$( A,  a(«i,  ^2))/  where 
de(-,  •)  is  the  cosine  of  the  angle  between  two  vectors;  i.e.,  the  response  is 
the  cosine  of  the  angle  between  A  and  a(u\,U2)-  The  last  step  searches  T 
for  noses.  In  some  applications,  the  matching  pursuit  filter  consists  of  more 
than  one  cluster  coefficient  vector.  In  this  case,  the  response  of  the  filter  at 
each  pixel  is  the  maximum  value  of  d$  taken  over  all  cluster  vectors. 

2.2  Identification 

In  detection  problems  such  as  those  discussed  in  this  work,  we  are  interested 
in  locating  a  face  or  facial  feature,  whereas  in  identification  we  are  inter¬ 
ested  in  distinguishing  among  faces.  The  filters  are  designed  from  images 
in  the  gallery,  and  the  filters  are  used  to  identify  unknown  faces  in  probes. 
The  generalization  to  more  complex  cases  is  straightforward;  these  include 
problems  such  as  character  recognition,  where  there  is  more  than  one  train¬ 
ing  example  per  class.  The  overall  strategy  for  designing  matching  pursuit 
filters  for  identification  is  the  same  as  for  detection,  except  that  a  different 
criterion  is  used  to  select  the  basis. 

For  detection,  the  matching  pursuit  filter  design  procedure  selects  a  basis 
in  which  the  coefficient  vectors  clustered,  and  only  one  coefficient  vector 
A  represents  a  class  of  objects.  For  detection,  A  is  compared  to  image  co¬ 
efficient  vectors.  Because  there  is  a  single  class,  only  one  coefficient  vector 
is  needed.  However,  for  identification,  to  distinguish  among  all  the  people 
in  the  database,  there  is  a  coefficient  vector  for  each  individual.  Person  l  is 
represented  by  coefficient  vector  A1  =  {«(), . . . ,  cxln-\}-  To  measure  the  sim¬ 
ilarity  between  an  unknown  face  and  individual  l,  we  compare  coefficient 
vectors  a(«i,  u2)  and  A1. 

A  face  centered  at  (ui,u2)  is  identified  as  person  l  if  the  distance  between 
a(ui,u2)  and  A1  is  minimized.  To  decrease  the  likelihood  that  faces  are 
misidentified,  the  matching  pursuit  design  algorithm  searches  for  a  basis 
that  separates  the  A1  coefficient  vectors.  The  algorithm  for  selecting  the  ith 
basis  element  for  identification  has  the  same  three  steps  as  for  detection, 
but  with  a  different  function  Cg.  For  identification, 

Cg{RHi, RHm,  Ai_i)  =  -  Y.  ma,xde(k,  l)  +  X  ^  ||  (<*§, . . . ,  af_2,  g))\\ 

k  ltk  k 
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selects  the  ith  basis  function.  The  function  d$(k,  l )  equals  the  cosine  of  the 
angle  between  (ckq, . . . ,  a\ L2,  9 ))  and  (ao>  •  •  •  >  a\-2i  <?))• The 

coefficient  vector  (n§, . . . ,  ccf_2)  represents  person  /c  after  the  i  -  1th  itera¬ 
tion.  If  g  were  selected  for  gi,  then  (aft, ,  af_2;  (Rl~ltk,g))  would  rep¬ 
resent  person  k  after  iteration  i.  The  first  term  in  Cg  forces  the  coefficient 
vectors  to  separate,  and  the  second  term  searches  for  sets  of  coefficient  vec¬ 
tors  with  the  largest  average  magnitude.  The  parameter  A  sets  the  relative 
importance  of  the  two  terms.  If  the  second  term  is  not  included,  the  filter 
becomes  too  sensitive  to  patterns  in  the  background.  Displayed  in  figure  1 
are  the  reconstructions  of  four  faces  from  an  identification  filter  using  equa¬ 
tion  (5).  For  identification,  the  output  from  the  matching  pursuit  filter  de¬ 
sign  algorithm  is  a  list  of  n  basis  elements  and  a  coefficient  vector  for  each 
person  in  the  training  set.  The  procedure  for  identifying  faces  in  images  is 
a  variant  of  the  method  that  detects  noses  (fig.  7).  As  before,  a(u\,  w2)  is  the 
image  coefficient  vector  centered  at  (uj .  u2).  For  detection,  a  single  response 
image  F  was  computed;  however,  for  identification,  a  response  Fk  is  com¬ 
puted  for  each  coefficient  vector  Ak,  where  Fk(ui,U2)  =  de{ Ak, a(«i,  u2)). 
The  estimated  identity  of  the  person  in  the  image  is  k,  which  is  found  by  a 
search  for  the  maximum  response  over  all  the  Fk  images.  More  precisely, 
the  face  in  the  image  is  identified  as  the  person  k  such  that 

Fk(u\,U2)=  max  Fk{ui,ui), 

k,{u\M2) 

where  («i,  fi2)  is  the  estimated  center  of  the  face  in  the  image. 

The  extension  to  distinguishing  among  multiple  classes  with  more  than  one 
example  is  straightforward.  It  is  a  combination  of  the  detection  and  iden¬ 
tification  modes  of  designing  matching  pursuit  filters.  Each  class  is  repre¬ 
sented  as  a  proto-object  or  cluster  vector.  The  function  Cg  selects  a  basis  that 
separates  the  cluster  vectors  while  simultaneously  forming  clusters  for  like 
objects  or  classes. 


Figure  7.  Pseudo-code  for 
scanning  an  image  with  an 
identification  filter.  (Goal  is 
to  estimate  k.) 


2(ui ,  U2) :  input  image. 

Ak  =  (ao , . .  • ,  a£_i)  :  coefficient  vector  that  represents  person  k. 

,  U2)  :  response  of  image  J  to  Ak. 
k  :  estimated  identity  of  person  in  the  image. 


1:  Compute  a(^i ,  U2)  for  each  pixel  in  J; 
2:  Compute 

Tk(uuu2)  =  do(Ak,a(ui,U2)); 

3:  Identify  the  person  in  image  J; 

Find  k  and  (ui,  W2), 

where  Tk(ui,U2)  =  max/c,^,^) 


The  algorithm  reports  that  the  identity  of  the  person  in  image  1 
is  k  and  the  face  is  centered  at  (tTi, ^2)- 
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3.  The  Face  Identification  System 


The  face  identification  system  consists  of  three  modules.  The  first  is  an  off¬ 
line  preprocessing  module  that  designs  the  matching  pursuit  filters.  The 
module  designs  two  sets  of  filters:  one  for  detecting  and  locating  facial  fea¬ 
tures  and  the  other  for  identifying  faces.  This  module  also  creates  the  initial 
gallery.  The  second  module  updates  the  gallery  by  adding  or  deleting  in¬ 
dividuals  from  the  gallery.  The  third  module  is  the  on-line  portion  of  the 
algorithm  that  takes  as  input  images  of  unknown  faces  and  returns  their 
identity.  This  module  consists  of  two  stages.  The  first  stage  detects  the  face 
in  the  image  and  locates  a  small  set  of  facial  features.  The  locations  of  these 
facial  features  are  fed  to  the  second  stage,  which  identifies  the  face.  The 
identity  is  determined  by  comparison  of  the  facial  features  of  the  unknown 
face  with  representations  stored  in  the  gallery.  Figure  8  shows  the  system 
organization  of  the  modules. 

The  face  recognition  algorithm  is  currently  designed  to  identify  people 
from  full  face  frontal  images  using  a  small  set  of  facial  features  (fig.  9).  In 
the  algorithm,  a  feature  is  a  region  of  the  image  that  contains  a  prominent 
facial  feature.  (For  example,  the  left  eye  feature  is  the  region  of  the  face  that 
contains  the  left  eye.)  The  features  are  the  tip  and  bridge  of  the  nose,  both 
eyes,  and  the  interior  of  the  face.  The  interior  of  the  face  feature  is  down- 
sampled  by  a  factor  of  four. 


Figure  8.  System 
organization. 
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Figure  9.  Facial  features 
used:  A  is  interior  of  face, 
B  is  tip  of  nose,  C  and  D 
are  left  and  right  eyes,  E  is 
bridge  of  nose. 


The  nose  and  eyes  were  selected  because  they  are  the  least  variable  fa¬ 
cial  features.  Using  these  features  increases  the  robustness  of  the  algorithm 
with  respect  to  variations  in  facial  expression,  facial  hair,  hair  style,  and 
the  image  background.  The  interior  of  the  face  feature  encodes  the  overall 
shape  and  organization  of  the  face. 

The  heart  of  the  face  identification  system  is' the  matching  pursuit  filters, 
which  detect  the  face,  locate  facial  features,  and  identify  the  face.  The  fil¬ 
ters  are  used  in  both  stages  of  the  on-line  module  (the  first  stage  detects 
and  locates  facial  features,  and  the  second  stage  identifies  faces).  The  set  of 
filters  in  the  identification  stage  consists  of  five  filters — one  for  each  of  the 
facial  features  (tip  and  bridge  of  the  nose,  left  and  right  eyes,  and  interior 
of  the  face).  There  are  four  filters  in  the  detection  stage — tip  of  the  nose, 
left  and  right  eyes,  and  interior  of  the  face.  The  location  of  the  bridge  of  the 
nose  is  estimated  as  being  mid-way  between  the  eyes. 

The  preprocessing  module  designs  both  sets  of  matching  pursuit  filters. 
The  training  set  for  designing  the  filters  is  taken  from  the  gallery.  (For  ex- 
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ample,  the  nose  filters  are  trained  from  noses  in  images  in  the  gallery.)  The 
location  of  the  feature  in  the  gallery  image  is  marked  by  a  human  operator. 
In  the  current  implementation,  only  three  points  are  marked,  and  accuracy 
is  not  critical.  The  points  selected  are  the  center  of  the  eyes  and  the  tip  of 
the  nose.  The  center  of  the  eyes  marks  the  location  of  the  eye  features;  the 
tip  of  the  nose  marks  the  center  of  the  interior  of  the  face  and  the  tip  of 
the  nose  features.  The  center  of  the  bridge  of  the  nose  is  the  average  of  the 
pixels  marked  as  the  center  of  the  eyes. 

In  theory,  the  filters  should  be  designed  from  all  images  in  the  gallery;  how¬ 
ever,  for  computational  reasons  this  is  not  practical.  In  practice,  filters  are 
designed  from  a  subset  of  the  faces  in  the  gallery.  For  the  identification  fil¬ 
ters,  the  remaining  faces  are  added  to  the  gallery  by  the  update  procedure. 
The  examples  used  to  design  the  feature  detection  filters  are  limited  to  a 
subset  of  the  images  in  the  gallery,  and  the  filters  are  not  modified  when 
the  gallery  is  updated. 

The  gallery  can  be  updated  by  either  the  deletion  or  addition  of  a  person. 
The  system  adds  a  person  to  the  gallery  by  computing  the  coefficient  vector 
for  each  feature.  The  coefficient  vector  is  computed  by  each  feature  being 
projected  onto  the  appropriate  wavelet  expansion  by  equations  (2)  and  (3). 
The  new  coefficient  vectors  are  then  added  to  the  gallery.  The  locations  of 
the  features  in  new  gallery  images  are  marked  by  a  human  operator.  The 
system  deletes  people  from  the  gallery  by  removing  their  coefficient  vectors 
from  the  gallery. 

3.1  Feature  Location 

The  feature  location  stage  estimates  the  locations  of  the  eyes,  tip  and  bridge 
of  nose,  and  center  of  the  interior  of  the  face.  The  first  feature  located  is  the 
interior  of  the  face.  In  this  stage,  the  module  finds  the  face  in  the  image 
by  running  the  interior  face  filter  on  a  decimated  version  of  the  probe  im¬ 
age.  The  next  set  of  features  located  is  the  eyes  and  the  tip  of  the  nose.  The 
search  for  these  facial  features  is  guided  by  a  priori  knowledge  of  the  ge¬ 
ometry  of  the  face  and  the  estimated  location  of  the  center  of  the  face:  e.g., 
the  right  eye  is  to  the  right  and  above  the  center  of  the  face.  For  locating 
these  features,  the  full-size  probe  image  is  used.  The  estimated  center  of 
the  bridge  of  the  nose  is  midway  between  the  estimated  centers  of  the  eyes. 
The  locations  are  then  passed  to  the  identification  stage.  To  avoid  introduc¬ 
ing  a  one-point  failure  in  the  algorithm,  the  feature-location  stage  passes 
multiple  locations. 

The  first  step  in  the  feature  location  stage  searches  for  the  most  likely  loca¬ 
tions  of  the  face  by  running  the  interior  face  filter  over  the  level  0  image  (the 
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coarsest  level  image).  The  top  N  responses  to  the  filter  are  reported  as  the 
most  likely  locations  of  the  face.  To  avoid  the  situation  where  the  hypoth¬ 
esized  locations  are  neighbors,  the  algorithm  uses  the  side  condition  that 
the  locations  must  be  a  certain  distance  apart.  Figure  10  contains  pseudo¬ 
code  for  the  face  detection  and  feature  location  algorithm  with  N  =  1.  The 
value  of  N  selected  is  a  trade-off  between  the  performance  and  speed  of 
the  algorithm.  As  N  increases,  the  probability  of  correctly  locating  the  face 
increases,  as  does  the  time  to  identify  a  face  in  a  probe  image.  In  the  current 
implementation,  N  =  3. 

The  system  uses  the  level  1  image  to  locate  the  remaining  features  by  run¬ 
ning  the  appropriate  filter  over  a  small  region  of  the  image.  The  feature  is 
located  at  the  maximum  response  of  the  filter  in  that  region.  The  location 
and  size  of  the  region  is  based  on  the  hypothesized  location  of  the  face,  es¬ 
timated  error  margins  of  the  coarse-level  processing,  and  the  geometry  of 
the  face.  For  example,  say  the  algorithm  is  searching  for  the  left  eye  after 
the  coarse  level  reports  a  detection.  The  coarse-level  detector  reports  a  pixel 
(zti .  112)  that  is  the  estimated  location  of  the  center  of  the  face,  which  corre¬ 
sponds  to  the  tip  of  the  nose.  From  the  gallery,  it  is  known  that  the  average 
translation  from  the  tip  of  the  nose  to  the  center  of  the  left  eye  is  (#i,  <2) 
and  that  a  good  error  region  is  a  pi  by  P2  pixel  box.  If  the  error  estimate  for 
the  coarse-level  processing  is  an  ri  by  r2  pixel  box,  the  left  eye  region  is  a 
Pi  +  n  by  P2  +  ^2  pixel  box  centered  at  (u\  +  ii ,  «2  +  h)-  The  feature  location 
algorithm  reports  the  top  N  locations  for  each  feature  along  with  an  error 
box  for  that  feature. 

Xo  :  level  0  image  (decimated); 

Xi  :  level  1  image  (original); 

(t  j ,  t2) :  average  translation  between  the  nose  and  left  eye; 

(—<1 ,  t2) :  average  translation  between  the  nose  and  right  eye; 

R  :  pi  x  p2  box; 

1:  Search  Xo  image  for  the  center  of  the  face, 
use  face  detection  matching  pursuit  filter, 
report  that  the  center  of  face  is  pixel  (u  1 ,  u2); 

2:  Search  box  R  centered  at  (u  1 , 112)  in  Xi  for  the  tip  of  the  nose, 
use  the  nose  detection  filter; 

3:  Search  box  R  centered  at  («i  +  t\ ,  112  +  *2)  in  Xi  for  the  left  eye, 
use  the  left  eye  detection  filter; 

4:  Search  box  R  centered  at  (ui  —  t\,u2  +  <2)  in  Xj  for  the  right  eye, 
use  the  right  eye  detection  filter; 

5:  Report  the  bridge  of  nose  as  being  mid-way  between  the  eyes; 


Figure  10.  Pseudo-code  for 
face  detection  and  feature 
location  algorithm. 
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3.2  Identification  of  Individuals 


The  second  stage  of  the  on-line  module  identifies  the  face  in  the  probe  by 
comparing  the  coefficient  vectors  generated  from  the  probe  with  the  coeffi¬ 
cient  vectors  stored  in  the  gallery.  Each  face  in  the  gallery  is  represented  by 
five  coefficient  vectors,  one  for  each  feature.  The  system  identifies  a  face  in 
a  probe  by  comparing  coefficient  vectors  from  the  probe  with  the  coefficient 
vectors  of  all  the  faces  in  the  gallery.  All  faces  in  the  gallery  are  compared 
with  the  probe,  and  a  similarity  score  is  produced  for  each  comparison.  The 
probe  is  identified  as  the  face  from  the  gallery  with  the  highest  similarity 
score. 

The  input  to  the  identification  stage  is  a  set  of  hypothesized  locations  for 
each  feature.  The  hypothesized  location  for  feature  j  is  region  Rj,  1  >  j  >  5 
(one  for  each  feature).  If  more  than  one  set  of  feature  locations  is  reported, 
then  the  algorithm  is  repeated  for  each  of  the  feature  sets,  with  the  best 
match  taken  as  the  answer.  The  algorithm  proceeds  by  computing  the  ex¬ 
pansion  for  each  pixel  in  region  R-y,  the  resulting  sets  of  coefficient  vectors 
will  be  denoted  by  sJ(u\,U2). 

The  gallery  consists  of  M  individuals,  and  each  individual  is  represented 
by  N  coefficient  vectors.  Let  A8-7  denote  coefficient  vector  j  of  person  i.  The 
next  step  in  the  identification  module  is  to  find  the  best  match  between 
feature  j  of  person  i  and  the  estimated  locations  of  feature  j  in  the  probe 
image.  This  is  done  for  each  feature  of  each  person.  Let  denote  the  score 
of  the  best  match  between  feature  j  of  person  i  and  the  region  Rj.  The  score 
is  computed  as 


dij  =  max  doiR13  ,aJ  {u\,U2)). 

(wi  ,U2  )  €  Rj 

For  each  person  in  the  gallery,  a  total  score  is  computed  that  represents  the 
degree  of  similarity  with  the  face  in  the  probe  image.  The  total  score  is  a 
weighted  sum  of  the  best  matches  for  the  features  of  person  i.  Let 

di  =  YlwodH 

3 

be  the  score  for  the  match  between  the  unknown  face  and  person  i  in  the 
gallery,  where  Wj  is  the  weight  given  to  feature  j.  In  this  implementation, 
the  weight  for  the  interior  of  the  face  is  0.5,  and  the  weights  for  the  remain¬ 
ing  features  are  1.0.  The  unknown  face  is  identified  as  the  person  with  the 
maximum  di.  The  system  detects  faces  that  are  not  in  the  gallery  by  set¬ 
ting  a  threshold  5.  If  the  score  of  the  best  match  is  below  5,  then  the  face  is 
reported  as  not  being  in  the  gallery. 
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Figure  11.  Pseudo-code  for 
identification  algorithm. 


Figure  11  contains  pseudo-code  for  the  identification  algorithm  presented 
in  this  section.  For  clarity,  the  pseudo-code  uses  only  two  features,  the  face 
and  the  nose. 


Rface  :  regi°n  hypothesized  to  contain  the  face; 

Rnose  :  region  hypothesized  to  contain  the  nose; 

Aface  *  ^ace  coefficient  vector  for  person  k; 

Anose  :  nose  coefficient  vector  for  person  k ; 

1:  do  k  =  1  to  k  :  number  of  images  in  database; 

2:  Search  Rface  for  maximum  response  to  Aface;  (figure  6) 

Let  S£ace  be  the  maximum  response; 

3:  Search  .Rnose  for  maximum  response  to  An0se; 

Let  Snose  he  the  maximum  response; 

4:  Let  Sk  =  Sface  +  5nose/ 

where  Sk  is  the  total  score  for  person  k ; 

5:  end  do; 

6:  Identify  the  unknown  face  as  the  person  with  the  highest  total  score; 
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4.  Experiments 


To  demonstrate  the  ability  of  this  algorithm  to  recognize  faces,  I  conducted 
four  sets  of  experiments.  The  first  decomposes  and  reconstructs  a  single 
nose;  the  second  identifies  faces  from  a  large  database;  the  third  compares 
the  performance  of  face  recognition  on  infrared  and  visible  imagery;  and 
the  fourth  identifies  faces  from  a  database  of  mugshots  from  a  law  enforce¬ 
ment  agency. 

Matching  pursuit  filters  were  constructed  from  dictionaries  of  wavelets. 
The  primary  dictionary  was  composed  of  separable  steerable  filters  [34], 
namely,  the  second  partial  derivatives  of  the  Gaussian  density  and  their 
Hilbert  transforms.  These  filters  were  chosen  because  they  are  computa¬ 
tionally  less  expensive  than  nonseparable  filters.  The  primary  dictionary 
was  used  in  all  four  experiments.  In  the  first  experiment  I  also  used  a  dic¬ 
tionary  of  Gabor  wavelets  (for  computational  ease,  only  the  odd  and  even 
phase  Gabor  wavelets  were  used).  Neither  dictionary  spanned  the  space 
of  all  possible  images.  Both  dictionaries  contained  wavelets  at  four  scales 
uniformly  sampled  in  angular  domain. 

4.1  Decomposition  of  One  Instance 

The  purpose  of  matching  pursuit  filters  is  to  produce  a  representation 
that  is  tuned  to  a  particular  pattern  recognition  problem.  However,  if  the 
training  set  consists  of  a  single  image,  the  filter  design  algorithm  finds  a 
compression-based  representation.  The  first  experiment  demonstrated  this 
by  decomposing  and  reconstructing  a  single  nose  (fig.  12).  Two  different 
decompositions  were  computed  from  the  two  different  dictionaries  (the 
steerable  filters  and  Gabor  wavelets).  Figure  12  shows  the  results  of  this 
experiment.  Figure  12(a)  is  the  original  nose;  figure  12(b)  is  the  nose  recon¬ 
structed  from  the  first  75  terms  of  the  separable  steerable  filter  dictionary; 
and  figure  12(c)  is  the  nose  reconstructed  from  the  first  75  terms  of  a  Ga¬ 
bor  wavelet  dictionary.  This  reduction  of  one  example  to  a  compression 
algorithm  is  what  one  would  expect  from  the  original  work  of  Mallat  and 
Zhang  [11]. 

4.2  Face  Recognition  from  FERET  Images 

The  second  experiment  was  the  main  experiment,  where  the  face  recogni¬ 
tion  algorithm  was  run  on  a  gallery  of  311  individuals  with  one  image  per 


Figure  12.  Decomposition 
of  one  image,  (a)  Original 
image,  (b)  Reconstruction 
using  steerable  filters. 

(c)  Reconstruction  using 
Gabor  filters. 


(b)  (c) 

person.  The  images  were  from  the  FERET  database  of  facial  images  [35]. 
Figure  13  shows  images  from  the  FERET  database  used  in  the  experiment. 
The  fa  images  were  in  the  gallery,  and  the  fb  images  were  probes.  The  train¬ 
ing  set  for  the  identification  filters  consisted  of  images  of  58  individuals, 
one  image  per  person.  Matching  pursuit  filters  were  constructed  for  each 
of  the  five  facial  features,  and  each  filter  had  30  coefficients  (and  is  the  same 
for  all  identification  experiments).  The  algorithm  computes  the  coefficient 
vectors  for  the  remaining  253  people  using  the  expansion  from  the  identi¬ 
fication  filters.  This  procedure  demonstrates  that  the  filters  generalize  and 
that  they  do  not  need  to  be  redesigned  when  a  new  person  is  added  to  the 
database.  This  is  a  critical  concern  for  real-world  applications. 

The  face  recognition  algorithm  was  run  in  two  modes.  The  first  mode  tested 
the  identification  portion  of  the  face  recognition  algorithm  by  providing 
the  eye  and  nose  coordinates  to  the  algorithm.  The  second  mode  tested  the 
complete  face  recognition  algorithm  by  having  the  algorithm  locate  and 
identify  the  face.  For  both  runs  of  the  algorithm,  the  percentage  of  faces 
correctly  identified  in  the  top  1,  2,  3,  and  4  matches  is  reported  (table  1). 
The  performance  of  the  algorithm  in  both  modes  is  virtually  identical,  in¬ 
dicating  that  in  terms  of  system  performance,  the  feature  location  algorithm 
is  working  nearly  perfectly. 
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fa 


Figure  13.  Face  images 
from  FERET  database;  fa 
images  were  placed  in 
gallery  and  fb  images  were 
probes. 


fb 


fb 


fb 


Table  1.  Percentage  of 
correctly  identified  faces 
for  a  database  of  311 
individuals. 


Top  match 

Correctly  identified  faces  (%) 

Identification 

only 

Location  & 
identification 

1 

95.4 

95.2 

2 

97.4 

97.4 

3 

98.1 

98.1 

4 

98.4 

98.4 

In  the  current  implementation  of  the  system,  there  are  five  features:  the 
eyes,  tip  and  bridge  of  the  nose,  and  interior  of  the  face.  If  a  face  is  occluded 
or  the  hair  style  is  changed,  the  number  of  usable  features  can  decrease.  To 
measure  the  effect  of  such  changes  on  performance,  I  ran  the  algorithm  us- 
ing  progressively  fewer  features.  The  first  feature  to  be  removed  was  the 


interior  of  the  face,  because  if  one  of  the  other  features  were  corrupted, 
the  interior  of  the  face  would  also  be  corrupted.  To  isolate  this  effect  from 
errors  in  locating  facial  features,  I  located  the  features  in  the  probe  set  man¬ 
ually.  Table  2  reports  the  results  of  this  experiment  and  the  combination  of 
features  tested. 

4.3  Face  Recognition  from  Infrared  and  Visible  Imagery 

An  area  of  growing  interest  is  identifying  faces  from  different  modalities, 
in  particular  from  infrared  imagery.  (For  example.  Wilder  et  al.  [36]  were 
among  the  first  to  address  this  issue  in  a  comparison  of  the  relative  perform¬ 
ance  of  three  face  recognition  algorithms  on  infrared  and  visible  images.) 
In  this  experiment.  Wilder  et  al  measured  recognition  performance  of  three 
algorithms  on  both  IR  and  visible  images;  these  algorithms  were  matching 
pursuit  filters,  gray  scale  projection  [37],  and  principal  component  analysis 
[14].  The  goal  of  this  study  was  not  to  compare  performance  between  algo¬ 
rithms,  but  to  measure  their  relative  performance  on  the  two  modalities. 

For  the  study,  infrared  and  visible  images  of  101  subjects  were  collected. 
The  infrared  images  were  acquired  with  a  Texas  Instruments  SMRTII  un¬ 
cooled  sensor,  which  detects  radiated  heat.  The  visible  images  were  col¬ 
lected  with  a  CIDTEC  2250  CID  camera.  Figure  14  shows  infrared  and  vis¬ 
ible  images  of  a  subject.  For  an  accurate  measure  of  recognition  perform¬ 
ance,  the  images  were  scaled  and  rotated  into  a  standard  position.  In  both 
modalities,  all  algorithms  were  run  on  two  sets  of  images.  The  results  are 
reported  in  table  3.  For  each  run,  the  table  reports  the  percentage  of  correct 
answers  in  the  top  1  and  2  matches. 


Table  2.  Performance  as 
number  of  facial  features  is 
reduced. 


Correctly  identified  faces  (%) 


Top  match 

All  features 

Eye  and 
nose 

Left  eye  & 
nose 

Eye  & 
bridge 

1 

95.4 

91.6 

91.6 

90.7 

2 

97.4 

93.4 

94.5 

92.3 

3 

98.1 

95.8 

96.1 

93.6 

4 

98.4 

96.8 

96.8 

93.9 

Left  &  right 
eyes 

Tip  of  nose 
&  left  eye 

Left  eye 
only 

Nose 

only 

Face 

only 

1 

87.1 

89.4 

81.4 

80.1 

78.5 

2 

88.7 

91.3 

85.5 

84.6 

82.0 

3 

88.7 

93.2 

86.8 

85.9 

85.2 

4 

88.7 

93.4 

88.4 

87.1 

87.5 
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Figure  14.  Images  of  same 
person  in  (a)  infrared  and 
(b)  visible. 


Table  3.  Comparison  of 
recognition  of  infrared  and 
visible  images  in  a 
database  of  101 
individuals. 


(a) 

(b) 

Correctly  identified  faces  (%) 

Gray  scale 

Matching  pursuit 

projection 

Eigenface 

filters 

Top  Top  2 

Top  Top  2 

Top  Top  2 

Image  set  match  matches 

match  matches 

match  matches 

Infrared  set  1 

91 

94 

86 

92 

89 

94 

Infrared  set  2 

90 

93 

86 

89 

94 

96 

Visible  set  1 

84 

91 

86 

93 

94 

96 

Visible  set  2 

83 

90 

89 

94 

96 

98 

If  one  has  images  of  a  person  in  two  modalities,  the  simplest  method  of 
improving  performance  is  to  fuse  the  result  from  each  modality.  Each  of  the 
algorithms  in  this  experiment  produces  a  numeric  score  of  the  similarity 
between  an  unknown  face  and  each  image  in  the  gallery.  The  identity  is 
then  determined  by  the  best  similarity  score  between  the  probe  and  gallery 
images.  In  this  experiment,  I  fuse  the  two  modalities  by  linear  pooling  of 
the  similarity  scores  from  the  individual  modalities.  This  was  done  for  both 
image  sets  1  and  2  (see  table  3),  with  the  results  in  table  4.  In  all  cases,  fusing 
the  results  improves  performance  to  the  saturation  point.  This  suggests  that 
a  future  area  of  research  is  studying  methods  of  fusing  infrared  and  visible 
imagery. 
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Table  4.  Face  recognition  _ Correctly  identified  faces  (%) 


when  data  from  infrared 
and  visible  images  are 
fused  (database  of  101 

Gray  scale 
projection 

J 

Eigenface 

Matching  pursuit 
filters 

individuals). 

Top 

Top  2 

Top 

Top  2 

Top 

Top  2 

Image  set 

match  matches 

match  matches 

match  matches 

1 

98 

99 

97 

99 

99 

99 

2 

97 

98 

96 

99 

98 

99 

4.4  Mugshot  Gallery 

One  potential  application  for  a  face  identification  algorithm  is  the  elec¬ 
tronic  mugbook.  In  an  electronic  mugbook,  the  gallery  consists  of  digital 
mugshots  of  known  people,  and  a  probe  is  a  digital  mugshot  of  an  individ¬ 
ual  to  be  identified.  In  this  experiment,  I  ran  the  algorithm  mugshot  data 
provided  by  a  law  enforcement  agency. 

In  the  experiment,  the  gallery  and  probe  sets  were  made  up  from  dig¬ 
ital  mugshots  of  2175  persons  with  two  frontal  images  per  person  (side 
mugshots  were  not  collected).  The  gallery  was  made  up  of  one  image  from 
each  pair,  for  2175  persons;  the  probe  set  consisted  of  the  other  paired  im¬ 
age,  for  2126  individuals.  The  two  images  of  each  person  were  taken  within 
a  few  minutes  of  each  other;  however,  the  subjects  were  not  necessarily 
cooperative. 

To  obtain  a  performance  baseline,  I  ran  the  dataset  on  an  eigenface  imple¬ 
mentation.  In  this  implementation,  the  images  were  placed  in  a  standard 
position  and  masked;  the  pixel  values  inside  the  mask  were  then  processed 
by  a  histogram  equalization  algorithm.  The  eigenfaces  were  trained  on  a 
subset  of  500  images,  and  the  faces  were  represented  by  the  first  200  eigen¬ 
vectors.  For  identification,  the  Li  metric  was  used  to  measure  the  similarity 
between  faces. 

Two  versions  of  the  matching  pursuit  filter  algorithm  were  run.  In  the  first 
version,  the  similarity  measure  between  coefficient  vectors  was  the  angle 
between  vectors;  in  the  second,  the  Li  metric  was  the  similarity  measure. 
To  assess  the  ability  of  matching  pursuit  filters  to  generalize  across  datasets, 
I  used  the  expansions  from  the  visible  images  in  the  IR/visible  experiment 
(sect.  4.3). 

The  results  are  presented  in  figure  15  on  a  cumulative  match  plot.  The  x- 
axis  is  the  rank  of  the  ordering  of  the  gallery  from  a  match  with  a  probe. 
The  y-axis  is  the  fraction  of  the  probes  correctly  identified.  The  plot  re¬ 
ports  the  fraction  of  the  probes  for  which  the  correct  answer  is  in  the  top 
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Figure  15.  Performance  of 
MPF  and  eigenfaces  on 
mugshot  gallery.  Gallery 
size  =  2175,  probe  set  size 
=  2126. 


0.9 

0.8 

0.7 

0.6 

0.5 

0.4 

0.3 

0.2 

0.1 

0.0 


— •  Matching  pursuit  filters  (Li  metric) 

—  Matching  pursuit  filters  (angle  metric) 
— Eigenface  algorithm 
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n  matches.  For  example,  for  the  Lt  version  of  the  matching  pursuit  algo¬ 
rithm,  the  correct  match  between  a  probe  and  gallery  is  in  the  top  10  for 
0.95  of  the  probes.  The  results  show  that  matching  pursuit  performs  better 
than  eigenfaces,  and  that  the  L\  metric  is  better  for  matching  pursuit  than 
the  angle  between  coefficient  vectors.  This  experiment  shows  that  matching 
pursuit  filters  can  generalize  across  data  sets  and  to  larger  datasets. 
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5.  Conclusion 


I  describe  a  face  identification  algorithm  that  automatically  locates  facial 
features  and  identifies  the  located  faces.  The  algorithm  is  based  on  a  new 
class  of  filters  called  matching  pursuit  filters,  which  are  wavelet  expansions 
that  are  both  data  and  problem  adaptive.  This  means  that  matching  pursuit 
filters  are  explicitly  designed  to  solve  the  pattern  recognition  problems  en¬ 
countered  in  face  recognition:  locating  faces  and  distinguishing  between 
faces.  Thus,  two  sets  of  filters  were  constructed:  one  set  for  the  detection  of 
facial  features  and  one  set  for  the  identification  of  faces. 

The  algorithm  was  run  on  images  from  the  FERET  database  and  an  associ¬ 
ated  database  of  visible  and  infrared  images.  The  results  of  the  visible  and 
infrared  study  show  that  the  performance  of  algorithms  based  on  matching 
pursuit  filters  is  comparable  for  both  modalities.  When  the  results  of  both 
modalities  are  fused,  performance  saturates  {>97  percent  correct  identifi¬ 
cation).  A  larger  database  of  infrared  and  visible  images  is  required  for  an 
accurate  assessment  of  the  capabilities  of  multi-modal  algorithms. 

A  number  of  algorithms  in  the  literature  report  performance  figures  from 
tests  run  on  the  FERET  database.  Within  this  group,  I  restrict  comparison 
to  those  algorithms  that  automatically  process  probe  images.  Each  of  these 
algorithms  uses  different  training  and  tests,  so  the  results  quoted  only  pro¬ 
vide  an  indirect  comparison  among  the  algorithms.  The  algorithm  given 
here  does  better  than  the  benchmark  algorithm  of  Gutta  et  al.  [28]  (83  per¬ 
cent  on  a  gallery  of  200)  and  the  correlation-based  algorithm  of  Gordon  [27] 
(72  percent  on  a  gallery  of  194).  The  performance  of  the  algorithm  described 
here  is  comparable  to  that  of  the  eigenspace  algorithm  of  Moghaddam  and 
Pentland  [30]  (99.4  percent  on  a  gallery  of  150)  and  the  dynamic  link  archi¬ 
tecture  algorithm  of  Wiskott  et  al.  [22]  (97.3  percent  on  a  gallery  of  300). 

The  eigenspace  algorithm  represents  a  face  with  50  to  100  coefficients  and 
represents  the  face  in  a  global  encoding,  so  that  the  algorithm  is  faster  than 
it  would  be  if  local  encoding  were  used.  It  is  not,  however,  possible  to  ex¬ 
plicitly  account  for  local  deformations.  In  contrast,  dynamic  link  architec¬ 
ture  explicitly  handles  local  deformations  with  an  elastic  graph,  but  it  re¬ 
quires  ~4000  coefficients  to  represent  a  face.  The  method  presented  here  is 
a  compromise  between  these  two  methods.  Each  person  is  represented  by 
150  coefficients  (30  coefficients  for  each  of  the  five  features). 
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In  the  present  algorithm,  the  face  is  modeled  as  five  geometric  regions, 
which  gives  the  algorithm  the  flexibility  to  adjust  to  deformations  and  vari¬ 
ations  in  faces.  I  show  this  by  measuring  performance  as  the  number  of 
features  is  decreased.  With  all  five  features,  the  identification  rate  is  95.4 
percent;  with  only  one  feature,  it  decreases  to  around  80  percent.  The  degra¬ 
dation  in  performance  is  graceful,  with  the  greatest  drop  occurring  when 
the  number  of  features  is  reduced  from  two  to  one. 

I  have  successfully  demonstrated  the  performance  of  this  algorithm  in  a 
number  of  experiments  (on  a  gallery  of  311,  with  infrared  versus  visible 
images,  on  a  gallery  of  2175  mugshots,  and  as  the  number  of  features  de¬ 
creases).  This  success  shows  that  face  identification  algorithms  based  on 
matching  pursuit  filters  are  viable  and  can  serve  as  a  basis  for  a  practical 
face  identification  system. 
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